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ABSTRACT 

Predicting the occurrence of links is a fundamental problem in net- 
works. In the link prediction problem we are given a snapshot of a 
network and would like to infer which interactions among existing 
members are likely to occur in the near future or which existing 
interactions are we missing. Although this problem has been ex- 
tensively studied, the challenge of how to effectively combine the 
information from the network structure with rich node and edge 
attribute data remains largely open. 

We develop an algorithm based on Supervised Random Walks 
that naturally combines the information from the network structure 
with node and edge level attributes. We achieve this by using these 
attributes to guide a random walk on the graph. We formulate a 
supervised learning task where the goal is to learn a function that 
assigns strengths to edges in the network such that a random walker 
is more likely to visit the nodes to which new links will be created 
in the future. We develop an efficient training algorithm to directly 
learn the edge strength estimation function. 

Our experiments on the Facebook social graph and large collab- 
oration networks show that our approach outperforms state-of-the- 
art unsupervised approaches as well as approaches that are based 
on feature extraction. 

Categories and Subject Descriptors: H.2.8 [Database Manage- 
ment]: Database applications — Data mining 

General Terms: Algorithms; Experimentation. 

Keywords: Link prediction, Social networks 

1. INTRODUCTION 

Large real-world networks exhibit a range of interesting proper- 
ties and patterns [7 20 j. One of the recurring themes in this line of 
research is to design models that predict and reproduce the emer- 
gence of such network structures. Research then seeks to develop 
models that will accurately predict the global structure of the net- 
work [7] no] num. 

Many types of networks and especially social networks are highly 
dynamic; they grow and change quickly through the additions of 
new edges which signify the appearance of new interactions be- 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

WSDM'll, February 9-12, 2011, Hong Kong, China. 
Copyright 2011 ACM 978-1-4503-0493-1/11/02 ...$10.00. 



tween the nodes of the network. Thus, studying the networks at 
a level of individual edge creations is also interesting and in some 
respects more difficult than global network modeling. Identifying 
the mechanisms by which such social networks evolve at the level 
of individual edges is a fundamental question that is still not well 
understood, and it forms the motivation for our work here. 

We consider the classical problem of link prediction |21| where 
we are given a snapshot of a social network at time t, and we seek 
to accurately predict the edges that will be added to the network 
during the interval from time t to a given future time t' . More con- 
cretely, we are given a large network, say Facebook, at time t and 
for each user we would like to predict what new edges (friendships) 
that user will create between t and some future time t' . The prob- 
lem can be also viewed as a link recommendation problem, where 
we aim to suggest to each user a list of people that the user is likely 
to create new connections to. 

The processes guiding link creation are of interest from more 
than a purely scientific point of view. The current Facebook system 
for suggesting friends is responsible for a significant fraction of link 
creations, and adds value for Facebook users. By making better 
predictions, we will be able to increase the usage of this feature, 
and make it more useful to Facebook members. 

Challenges. The link prediction and link recommendation prob- 
lems are challenging from at least two points of view. First, real 
networks are extremely sparse, i.e., nodes have connections to only 
a very small fraction of all nodes in the network. For example, in 
the case of Facebook a typical user is connected to about 100 out 
of more than 500 million nodes of the network. Thus, a very good 
(but unfortunately useless) way to predict edges is to predict no new 
edges since this achieves near perfect predictive accuracy (i.e., out 
of 500 million possible predictions it makes only 100 mistakes). 

The second challenge is more subtle; to what extent can the links 
of the social network be modeled using the features intrinsic to the 
network itself? Similarly, how do characteristics of users (e.g., age, 
gender, home town) interact with the creation of new edges? Con- 
sider the Facebook social network, for example. There can be many 
reasons exogenous to the network for two users to become con- 
nected: it could be that they met at a party, and then connected on 
Facebook. However, since they met at a party they are likely to be 
about the same age, and they also probably live in the same town. 
Moreover, this link might also be hinted at by the structure of the 
network: two people are more likely to meet at the same party if 
they are "close" in the network. Such a pair of people likely has 
friends in common, and travel in similar social circles. Thus, de- 
spite the fact that they became friends due to the exogenous event 
(i.e., a party) there are clues in their social networks which suggest 
a high probability of a future friendship. 

Thus the question is how do network and node features interact 



in the creation of new links. From the link creation point of view: 
how important is it to have common interests and characteristics? 
Furthermore, how important is it to be in the same social circle and 
be "close" in the network in order to eventually connect. From the 
technical point of view it is not clear how to develop a method that, 
in a principled way, combines the features of nodes (i.e., user pro- 
file information) and edges (i.e., interaction information) with the 
network structure. A common, but somewhat unsatisfactory, ap- 
proach is to simply extract a set of features describing the network 
structure (like node degree, number of common friends, shortest 
path length) around the two nodes of interest and combine it with 
the user profile information. 

Present work: Supervised Random Walks. To address these 
challenges we develop a method for both link prediction and link 
recommendation. We develop a concept of Supervised Random 
Walks that naturally and in a principled way combines the network 
structure with the characteristics (attributes, features) of nodes and 
edges of the network into a unified link prediction algorithm. 

We develop a method based on Supervised Random Walks that in 
a supervised way learns how to bias a PageRank-like random walk 
on the network [3] so that it visits given nodes (i.e., positive 
training examples) more often than the others. 

We achieve this by using node and edge features to learn edge 
strengths (i.e., random walk transition probabilities) such that the 
random walk on a such weighted network is more likely to visit 
"positive" than "negative" nodes. In the context of link prediction, 
positive nodes are nodes to which new edges will be created in the 
future, and negative are all other nodes. We formulate a supervised 
learning task where we are given a source node s and training ex- 
amples about which nodes s will create links to in the future. The 
goal is to then learn a function that assigns a strength (i.e., random 
walk transition probability) to each edge so that when computing 
the random walk scores in such a weighted network nodes to which 
s creates new links have higher scores to s than nodes to which s 
does not create links. 

From a technical perspective, we show that such edge strength 
function can be learned directly and efficiently. This means, that 
we do not postulate what it means for edge to be "strong" in an ad- 
hoc way and then use this heuristic estimate. Rather, we show how 
to directly find the parameters of the edge strength function which 
give optimal performance. This means we are able to compute the 
gradient of the parameters of the edge strength function with re- 
spect to the PageRank-like random walk scores. The formulation 
results in an optimization problem for which we derive an efficient 
estimation procedure. 

From the practical point of view, we experiment with large col- 
laboration networks and data from the Facebook network, show- 
ing that our approach outperforms state-of-the-art unsupervised ap- 
proaches as well as supervised approaches based on complex net- 
work feature extraction. An additional benefit of our approach is 
that no complex network feature extraction or domain expertise are 
necessary as our algorithm nicely combines the node attribute and 
network structure information. 

Applications and consequences. As networks evolve and grow by 
addition of new edges, the link prediction problem offers insights 
into the factors behind creation of individual edges as well as into 
network formation in general. 

Moreover, the link-prediction and the link-recommendation prob- 
lems are relevant to a number of interesting current applications of 
social networks. First, for online social networking websites, like 
Facebook and Myspace, being able to predict future interactions 
has direct business consequences. More broadly, large organiza- 



tions can directly benefit from the interactions within the informal 
social network among its members and link-prediction methods 
can be used to suggest possible new collaborations and interac- 
tions within the organization. Research in security has recently 
recognized the role of social network analysis for this domain (e.g., 
terrorist networks). In this context link prediction can be used to 
suggest the most likely links that may form in the future. Similarly, 
link prediction can also be used for prediction of missing or unob- 
served links in networks (9) or to suggest which individuals may 
be working together even though their interaction has yet been di- 
rectly observed. Applications go well beyond social networks, as 
our techniques can be used to predict unobserved links in protein- 
protein interaction networks in systems biology or give suggestions 
to bloggers about which relevant pages on the Web to link to. 

Furthermore, the framework we develop is more general than 
link prediction, and could be used for any sort of interaction. For 
instance, in a collaboration network, it could easily be used not to 
predict who s will link to next (write a paper with a previously 
un-collaborated-with person) but to predict who s will coauthor a 
paper with next, including all those with whom s has previously 
coauthored. 

Further related work. The link prediction problem in networks 
comes in many flavors and variants. For example, the network in- 
ference problem It 1 3 1 1241 can be cast as a link prediction problem 
where no knowledge of the network is given. Moreover, even mod- 
els of complex networks, like Preferential Attachment Q, Forest 
Fire model [ 20 1 and models based on random walks 119| [8l, can be 
viewed as ways for predicting new links in networks. 

The unsupervised methods for link prediction were extensively 
evaluated by Liben-Nowell and Kleinberg 1211 who found that the 
Adamic-Adar measure of node similarity (Tj performed best. More 
recently approaches based on network community detection ||9l ll6l 
have been tested on small networks. Link prediction in supervised 
machine learning setting was mainly studied by the relational learn- 
ing community [28 26]. However, the challenge with these ap- 
proaches is primarily scalability. 

Random walks on graphs have been considered for computing 
node proximities in large graphs f3T, 30, 29 27], They have also 
been used for learning to rank nodes in graphs |[3]|2l l23|[TTl . 

2. SUPERVISED RANDOM WALKS 

Next we describe our algorithm for link prediction and recom- 
mendation. The general setting is that we are given a graph and a 
node s for which we would like to predict/recommend new links. 
The idea is that s has already created some links and we would like 
to predict which links it will create next (or will be created to it, 
since the direction of the links is often not clear). For simplicity 
the following discussion will focus on a single node s and how to 
predict the links it will create in the future. 

Note that our setting is much more general than it appears. We 
require that for a node s we are given a set of "positive" and "neg- 
ative" training nodes and our algorithm then learns how to distin- 
guish them. This can be used for link prediction (positive nodes are 
those to which links are created in the future), link recommenda- 
tion (positive nodes are those which user clicks on), link anomaly 
detection (positive nodes are those to which s has anomalous links) 
or missing link prediction (positive nodes are those to which s has 
missing links), to name a few. Moreover, our approach can also 
be generalized to a setting where prediction/recommendation is not 
being made for only a single node s but also for a group of nodes. 

General considerations. A first general approach to link predic- 
tion would be to view it as a classification task. We take pairs 



of nodes to which s has created edges as positive training exam- 
ples, and all other nodes as negative training examples. We then 
learn a classifier that predicts where node s is going to create links. 
There are several problems with such an approach. The first is the 
class imbalance; s will create edges to a very small fraction of the 
total nodes in the network and learning is particularly hard in do- 
mains with high class imbalance. Second, extracting the features 
that the learning algorithm would use is a challenging and cumber- 
some task. Deciding which node features (e.g., node demographics 
like, age, gender, hometown) and edge features (e.g., interaction 
activity) to use is already hard. However, it is even less clear how 
to extract good features that describe the network structure and pat- 
terns of connectivity between the pair of nodes under consideration. 

Even in a simple undirected graph with no node/edge attributes, 
there are countless ways to describe the proximity of two nodes. 
For example, we might start by counting the number of common 
neighbors between the two nodes. We might then adjust the prox- 
imity score based on the degrees of the two nodes (with the intuition 
being that high-degree nodes are likely to have common neighbors 
by mere happenstance). We might go further giving different length 
two paths different weights based on things like the centrality or de- 
gree of the intermediate nodes. The possibilities are endless, and 
extracting useful features is typically done by trial and error rather 
than any principled approach. The problem becomes even harder 
when annotations are added to edges. For instance, in many net- 
works we know the creation times of edges, and this is likely to be 
a useful feature. But how do we combine the creation times of all 
the edges to get a feature relevant to a pair of nodes? 

A second general approach to the link prediction problem is to 
think about it as a task to rank the nodes of the network. The idea 
is to design an algorithm that will assign higher scores to nodes 
which s created links to than to those that s did not link to. PageR- 
ank 1251 and variants like Personalized PageRank 1171 1151 and 
Random Walks with Restarts (31] are popular methods for ranking 
nodes on graphs. Thus, one simple idea would be to start a random 
walk at node s and compute the proximity of each other node to 
node s |30| . This can be done by setting the random jump vector 
so that the walk only jumps back to s and thus restarts the walk. 
The stationary distribution of such random walk assigns each node 
a score (i.e., a PageRank score) which gives us a ranking of how 
"close" to the node s are other nodes in the network. This method 
takes advantage of the structure of the network but does not con- 
sider the impact of other properties, like age, gender, and creation 
time. 

Overview of our approach. We combine the two above approaches 
into a single framework that will at the same time consider rich 
node and edge features as well as the structure of the network. As 
Random Walks with Restarts have proven to be a powerful tool for 
computing node proximities on graphs we use them as a way to 
consider the network structure. However, we then use the node and 
edge attribute data to bias the random walk so that it will more often 
visit nodes to which s creates edges in the future. 

More precisely, we are given a source node s. Then we are also 
given a set of destination nodes di , . . . , d% £ D to which s will 
create edges in the near future. Now, we aim to bias the random 
walk originating from s so that it will visit nodes di more often 
than other nodes in the network. One way to bias the random walk 
is to assign each edge a random walk transition probability (i.e., 
strength). Whereas the traditional PageRank assumes that transi- 
tion probabilities of all edges to be the same, we learn how to as- 
sign each edge a transition probability so that the random walk is 
more likely to visit target nodes di than other nodes of the network. 
However, directly setting an arbitrary transition probability to each 



edge would make the task trivial, and would result in drastic over- 
fitting. Thus, we aim to learn a model (a function) that will assign 
the transition probability for each edge (u, v) based on features of 
nodes u and v, as well as the features of the edge (u, v). The ques- 
tion we address next is, how to directly and in a principled way 
estimate the parameters of such random walk biasing function? 

Problem formulation. We are given a directed graph G(V, E), a 
node s and a set of candidates to which s could create an edge. 
We label nodes to which s creates edges in the future as destina- 
tion nodes D = {d\ , . . . ,dk}, while we call other nodes to which 
s does not create edges no-link nodes L = {h, . . . , l n }. We la- 
bel candidate nodes with a set C = {ci} = D U L. We think of 
nodes in D as positive and nodes in L as negative training exam- 
ples. Later we generalize to multiple instances of s, L and D. Each 
node and each edge in G is further described with a set of features. 
We assume that each edge (u, v) has a corresponding feature vector 
ip uv that describes the nodes u and v (e.g., age, gender, hometown) 
and the interaction attributes (e.g., when the edge was created, how 
many messages u and v exchanged, or how many photos they ap- 
peared together in). 

For edge (u, v) in G we compute the strength a uv — f w (ip uv ). 
Function f m parameterized by w takes the edge feature vector ip uv 
as input and computes the corresponding edge strength a uv that 
models the random walk transition probability. It is exactly the 
function f w (tp) that we learn in the training phase of the algorithm. 

To predict new edges of node s, first edge strengths of all edges 
are calculated using f w . Then a random walk with restarts is run 
from s. The stationary distribution p of the random walk assigns 
each node u a probability p u . Nodes are ordered by p u and top 
ranked nodes are then predicted as destinations of future links of s. 

Now our task is to learn the parameters w of function f m (ip U v) 
that assigns each edge a transition probability a„ v . One can think 
of the weights a uv as edge strengths and the random walk is more 
likely to traverse edges of high strength and thus nodes connected 
to node s via paths of strong edges will likely be visited by the 
random walk and will thus rank higher. 

The optimization problem. The training data contains informa- 
tion that source node s will create edges to nodes d £ D and not 
to nodes I G L. So, we aim to set the parameters w of function 
fwijpuv) so that it will assign edge weights a uv in such a way that 
the random walk will be more likely to visit nodes in D than L, i.e., 
pi < pd, for each d € D and I £ L. 

Thus, we define the optimization problem to find the optimal set 
of parameters w of edge strength function f w (ip U v) as follows: 

min F(w) = \ \w\\ 2 

w 

such that (!) 

VdeD,l£L : pi <p d 

where p is the vector of PageRank scores. Note that PageRank 
scores pi depend on edge strengths a uv and thus actually depend 
on f w (ip uv ) that is parameterized by w. The idea here is that we 
want to find the parameter vector w such that the PageRank scores 
of nodes in D will be greater than the scores of nodes in L. We 
prefer the shortest w parameter vector simply for regularization. 

However, Eq. Q]is a "hard" version of the optimization problem 
as it allows no constraints to be violated. In practice it is unlikely 
that a solution satisfying all the constraints exists. Thus similarly to 
formulations of Support Vector Machines we make the constraints 
"soft" by introducing a loss function h that penalizes violated con- 



straints. The optimization problem now becomes: 

miiiFtw) = ||to|| 2 + A h(pi - p d ) 



(2) 



where A is the regularization parameter that trades-off between the 
complexity (i.e., norm of w) for the fit of the model (i.e., how much 
the constraints can be violated). Moreover, h(-) is a loss function 
that assigns a non-negative penalty according to the difference of 
the scores pi — pd- If pi — Pd < then h(-) = as pi < pd and 
the constraint is not violated, while forp; — pd > 0, also h(-) > 0. 

Solving the optimization problem. First we need to establish the 
connection between the parameters w of the edge strength function 
fw (V'ud) an d me random walk scores p. Then we show how to ob- 
tain the derivative of the loss function and the random walk scores 
p with respect to w and then perform gradient based optimization 
method to minimize the loss and find the optimal parameters w. 

Function f w (tp U v) combines the attributes ifj uv and the parame- 
ter vector w to output a non-negative weight a uv for each edge. We 
then build the random walk stochastic transition matrix Q': 







if (u, v) e E, 
otherwise 



(3) 



To obtain the final random walk transition probability matrix Q, 
we also incorporate the restart probability a, i.e., with probability 
a the random walk jumps back to seed node s and thus "restarts": 

Quv = (1 - a)Q' uv + al(v = s). 

Note that each row of Q sums to f and thus each entry Q uv defines 
the conditional probability that a walk will traverse edge (u, v) 
given that it is currently at node it. 

The vector p is the stationary distribution of the Random walk 
with restarts (also known as Personalized PageRank), and is the 
solution to the following eigenvector equation: 



p =p C 



(4) 

Equation[4]establishes the connection between the node PageR- 
ank scores p u 6 p, and parameters w of function f w (tf) U v) via the 
random walk transition matrix Q. Our goal now is to minimize 
Eq.[2]wifh respect to the parameter vector w. We approach this by 
first deriving the gradient of F(w) with respect to w, and then use a 
gradient based optimization method to find w that minimize F(w). 
Note that is non-trivial due to the recursive relation in Eq.[4] 

First, we introduce a new variable Sid ~Pi—Pd and then we can 
write the derivative: 



dF{w) 
dw 



2w 



E 

l,d 

dh{8id) 
dSid 



dh(pi -p d ) 



dw 



die 



dpd s 
dw ' 



(5) 



For commonly used loss functions h(-) (like, hinge-loss or squared 
loss), it is simple to compute the derivative 9 gg l 1 ^ ■ However, it is 

not clear how to compute , the derivative of the score p u with 
respect to the vector w. Next we show how to do this. 

Note that p is the principal eigenvector of matrix Q. Eq.[4]can be 
rewritten as p u = PjQju and taking the derivative now gives: 



dp u _ ^ dpj dQj U 



(6) 



Notice that p u and ^P±- are recursively entangled in the equation. 



Initialize PageRank scores p and partial derivatives 



foreach u e V do pl 0) = ^ 

foreach u € V, k = 1, . . 

t = 1 

while not converged do 
foreach u £ V do 

t = t + i 
t = i 

foreach k = 1, . . . , \w\ do 
while not converged do 
foreach u e V do 



i 

■,Mdo e£r 



t 



dw k 
t + 1 



return 



Algorithm 1: Iterative power-iterator like computation of 
PageRank vector p and its derivative ^P 1 - . 



recursively applying the chain rule to Eq. [6] we can use a power- 
method like algorithm to compute the derivative. We repeatedly 
compute the derivative based on the estimate obtained in the 
previous iteration. Thus, we first compute p and then update the 
estimate of the gradient . We stop the algorithm when both p 
and do not change (i.e., e = 10~ 12 in our experiments) between 
iterations. We arrive at Algorithm Q] that iteratively computes the 
eigenvector p as well as the partial derivatives of p. Convergence 
of Algorithm[T]is similar to those of power-iteration 0. 

To solve Eq. [4] we further need to compute —g^- which is the 
partial derivative of entry Qj U (Eq.[5}. This calculation is straight- 



forward. When (j, u) £ E we find 



(1-a)- 



E fe U(ipjk)) - U(tpju){J2k 



and otherwise 



(E fc /4^jfc)) 

= 0. The edge strength function fmi^uv) 

9U 



must be differentiable and so -^-(ipjk) can be easily computed. 

This completes the derivation and shows how to evaluate the 
derivative of F(w) (Eq.O. Now we apply a gradient descent based 
method, like a quasi-Newton method, and directly minimize F(w). 

Final remarks. First we note that our problem is not convex in 
general, and thus gradient descent methods will not necessarily find 
the global minimum. In practice we resolve this by using several 
different starting points to find a good solution. 

Second, since we are only interested in the values of p for nodes 
in C, it makes sense to evaluate the loss function at a slightly dif- 
ferent point: h(p'i — p' d ) where p is a normalized version of p such 



that p' u 



This adds one more chain rule application to 



However, we can still compute the gradient iteratively |4][3J. By 



the derivative calculation, but does not change the algorithm. The 
effect of this is mostly to allow larger values of ot to be used with- 
out having to change h(-) (We omit the tick marks in our notation 
for the rest of this paper, using p to refer to the normalized score). 

So far we only considered training and estimating the parameter 
vector w for predicting the edges of a particular node s. However, 
our aim to estimate w that make good predictions across many dif- 
ferent nodes s E S. We easily extend the algorithm to multiple 
source nodes s E S, that may even reside in different graphs. We 
do this by taking the sum of losses over all source nodes s and the 



corresponding pairs of positive D s and negative L s training exam- 
ples. We slightly modify the Eq.[2]to obtain: 
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min F(w) 



X J2 E 

ses d£D B ,ieL s 



h(pi - p d ) 



The gradients of each instance s £ S remain independent, and can 
thus be computed independently for all instances of s (Alg.QJ. By 
optimizing parameters w over many individuals s, the algorithm is 
less likely to overfit, which improves the generalization. 

As a final implementation note, we point out that gradient de- 
scent often makes many small steps which have small impact on 
the eigenvector and its derivative. A 20% speedup can be achieved 
by using the solutions from the previous position (in the gradient 
descent) as initialization for the eigenvector and derivative calcula- 
tions in Alg.Q] Our implementation of Supervised Random Walks 
uses the L-BFGS algorithm 122|. Given a function and its par- 
tial derivatives, the solver iteratively improves the estimate of w, 
converging to a local optima. The exact runtime of the method de- 
pends on how many iterations are required for convergence of both 
the PageRank and derivative computations, as well as of the overall 
process (quasi-Newton iterations). 



3. EXPERIMENTS ON SYNTHETIC DATA 

Before experimenting with real data, we examine the soundness 
and robustness of the proposed algorithm using synthetic data. Our 
goal here is to generate synthetic graphs, edge features and training 
data (triples (s, D, L)) and then try to recover the original model. 

Synthetic data. We generate scale-free graphs G on 10,000 nodes 
by using the Copying model 1181 : Graph starts with three nodes 
connected in a triad. Remaining nodes arrive one by one, each 
creating exactly three edges. When a node u arrives, it adds three 
edges (u,Vi). Existing node Vi is selected uniformly at random 
with probability 0.8, and otherwise Vi is selected with probability 
proportional to its current degree. For each edge (u, v) we create 
two independent Gaussian features with mean and variance 1. We 
set the edge strength a uv = exp(ip uv i — 4> U v2), i-e., w* = [1,-1]. 

For each G, we randomly select one of the oldest 3 nodes of G 
as the start node, s. To generate a set of destination D and no-link 
nodes L for a given s we use the following approach. 

On the graph with edge strengths a uv we run the random walk 
(a = 0.2) starting from s and obtain node PageRank scores p* . We 
use these scores to generate the destinations D in one of two ways. 
First is deterministic and selects the top K nodes according to p* 
to which s is not already connected. Second is probabilistic and 
selects K nodes, selecting each node u with probability p* . 

Now given the graph G, attributes t/>„„ and targets D our goal is 
to recover the true edge strength parameter vector w* — [1,-1]. 
To make the task more interesting we also add random noise to all 
of the attributes, so that ip'uvi = ifruvi +A/"(0, a 2 ), where 7V(0, a 2 ) 
is a Gaussian random variable with mean and variance a 2 . 

Results. After applying our algorithm, we are interested in two 
things. First, how well does the model perform in terms of the 
classification accuracy and second, whether it recovers the edge 
strength function parameters w* = [1, —1]. In the deterministic 
case of creating D and with noise added, we hope that the al- 
gorithm is able achieve near perfect classification. As the noise 
increases, we expect the performance to drop, but even then, we 
hope that the recovered values of w will be close to true w* . 
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Figure 1: Experiments on synthetic data. Deterministic D. 
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Figure 2: Experiments on synthetic data. Probabilistic D. 

In running the experiment we generated 100 synthetic graphs. 
We used 50 of them for training the weights w, and report results 
on the other 50. We compute Area under the ROC curve (AUC) 
of each of 50 test graphs, and report the mean (AUC of 1.0 means 
perfect classification, while random guessing scores 0.5). 

Figures[TJand[2]show the results. We plot the performance of the 
model that ignores edge weights (red), the model with true weights 
w* (green) and a model with learned weights w (blue). 

For the deterministically generated D (Fig. [7}, the performance 
is perfect in the absence of any noise. This is good news as it 
demonstrates that our training procedure is able to recover the cor- 
rect parameters. As the noise increases, the performance slowly 
drops. When the noise reaches a 2 ~ 1.5, using the true parame- 
ters w* (green) actually becomes worse than simply ignoring them 
(red). Moreover, our algorithm learns the true parameters [+1,-1] 
almost perfectly in the noise-free case, and decreases their magni- 
tude as the noise level increases. This matches the intuition that, 
as more and more noise is added, the signal in the edge attributes 
becomes weaker and weaker relatively to the signal in the graph 
structure. Thus, with more noise, the parameter values w decrease 
as they are given less and less credence. 

In the probabilistic case (Fig. [2}, we see that our algorithm does 
better (statistically significant at p = 0.01) than the model with 
true parameters w* , regardless of the presence or absence of noise. 
Even though the data was generated using parameters w* = [+1, — 1], 
these values are not optimal and our model gets better AUC by find- 
ing different (smaller) values. Again, as we add noise, the overall 
performance slowly drops, but still does much better than the base- 
line method of ignoring edge strengths (red), and continues to do 
better than the model that uses true parameter values w* (green). 

We also note that regardless of where we initialize the parameter 
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Figure 3: Probability of a new link as a function of the number 
of mutual friends. 




Figure 4: Facebook Iceland: Hop distance between a pair of 
nodes just before they become friends. Distance x=-l denotes 
nodes that were in separate components, while x=2 (friends of 
friends) is order of magnitude higher than next highest point. 

vector w before starting gradient descent, it always converges to the 
same solution. Having thus validated our algorithm on synthetic 
data, we now move on to predicting links in real social networks. 

4. EXPERIMENTAL SETUP 

For experiments on real data we consider four real physics co- 
authorship networks and a complete Facebook network of Iceland. 

Generally we focus on predicting links to nodes that are 2-hops 
from the seed node s. We do this for two reasons. First, in online 
social networks more than half of all edges at the time of creation 
close a triangle, i.e., a person connects to a friend of a friend 1191 . 
For instance, Figure[4]shows that 92% of all edges created on Face- 
book Iceland close a path of length two, i.e., a triangle. Second, this 
also makes the Supervised Random Walks run faster as graphs get 
smaller. Given that some Facebook users have degrees in the thou- 
sands, it is not practical to incorporate them (a user may have as 
many as a hundred million nodes at 3 hops). 

Co-authorship networks. First we consider the co-authorship net- 
works from arXiv e-print archive 1121 where we have a time-stamped 
list of all papers with author names and titles submitted to arXiv 
during 1992 and 2002. We consider co-authorship networks from 
four different areas of physics: Astro-physics (Astro-Ph), Con- 
densed Matter (Cond-Mat), High energy physics theory (Hep-th) 
and High energy physics phenomenology (Hep-ph). For each of 
the networks we proceed as follows. For every node u we compute 
the total number of co-authors at the end of the dataset (i.e., net- 
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Astro-Ph 


19,144 


198,110 


1,123 


18.0 


775.6 


0.023 


Cond-Mat 


23,608 


94,492 


140 


9.1 


335.5 


0.027 


Hep-Ph 


12,527 


118,515 


340 


29.2 


345.3 


0.084 


Hep-Th 


10,700 


25,997 


55 


6.3 


110.5 


0.057 


Facebook 


174,000 


29M 


200 


43.6 


1987 


0.022 



Table 1: Dataset statistics. N, E: number of nodes and edges 
in the full network, S: number of sources, C: avg. number of 
candidates per source, D: avg. number of destination nodes. 

work degree) k u and let t u be the time when u created it's fc u /2-th 
edge. Then we define m u to be the number of co-authorship links 
that u created after time t u and that at the time of creation spanned 
2-hops (i.e., closed a triangle). We attempt to make predictions 
only for "active" authors, where we define a node u to be active if 
k u > K and m u > A. In this work, we set K — 10 and A = 5. 
For every source node s that is above this threshold, we extract the 
network at time t s and try to predict the d a new edges that s creates 
in the time after t s . Table[TJgives dataset statistics. 

For every edge of the network around the source node u at 
time t u we generate the following six features: 

• Number of papers i written before t u 

• Number of papers j written before t u 

• Number of papers i and j co-authored 

• Cosine similarity between the titles of papers written by i and 
titles of j's papers 

• Time since i and j last co-authored a paper. 

• The number of common friends between j and s. 

The Facebook network. Our second set of data comes from the 
Facebook online social network. We first selected Iceland since it 
has high Facebook penetration, but relatively few edges pointing 
to users in other countries. We generated our data based on the 
state of the Facebook graph on November 1, 2009. The destination 
nodes D from a node s are those that s became friends with be- 
tween November 1 2009 and January 13 2010. The Iceland graph 
contains more than 174 thousand people, or 55% of the country's 
population. The average user had 168 friends, and during the pe- 
riod Nov 1 - Jan 23, an average person added 26 new friends. 

From these users, we randomly selected 200 as the nodes s. 
Again, we only selected "active" nodes, this time with the crite- 
ria \D\ > 20. As Figure[3]shows, individuals without many mutual 
friends are exceedingly unlikely to become friends. As the Face- 
book graph contains users whose 2-hop neighborhood have several 
million nodes we can prune such graphs and speed-up the compu- 
tations without loosing much on prediction performance. Since we 
know that individuals with only a few mutual friends are unlikely to 
form friendships, and our goal is to predict the most likely friend- 
ships, we remove all individuals with less than 4 mutual friends 
with practically no loss in performance. As demonstrated in Fig- 
ure [3] if a user creates an edge, then the probability that she links 
to a node with whom she has less than 4 friends is about 0.1%.). 

We annotated each edge of the Facebook network with seven 
features. For each edge we created: 

• Edge age: (T — , where T is the time cutoff Nov. 1, and 
t is the edge creation time. We create three features like this 
with = {0.1,0.3,0.5}. 

• Edge initiator: Individual making the friend request is en- 
coded as +1 or —1. 

• Communication and observation features. They represent the 
probability of communication and profile observation in a 
one week period. 



• The number of common friends between j and s. 

All features in all datasets are re-scaled to have mean and standard 
deviation 1 . We also add a constant feature with value 1 . 

Evaluation methodology. For each dataset, we assign half of the 
nodes s into training and half into test set. We use the training set 
to train the algorithm (i.e., estimate w). We evaluate the method on 
the test set, considering two performance metrics: the Area under 
the ROC curve (AUC) and the Precision at Top 20 (Prec@20), i.e., 
how many of top 20 nodes suggested by our algorithm actually 
receive links from s. This measure is particularly appropriate in 
the context of link-recommendation where we present a user with a 
set of friendship suggestions and aim that most of them are correct. 

5. EXPERIMENTS ON REAL DATA 

Next we describe the results of on five real datasets: four co- 
authorship networks and the Facebook network of Iceland. 

5.1 General considerations 

First we evaluate several aspects of our algorithm: (A) the choice 
of the loss function, (B) the choice of the edge strength function 
f w(-), (C) the choice of random walk restart (jump) parameter a, 
and (D) choice of regularization parameter A. We also consider the 
extension where we learn a separate edge weight vector depending 
on the type of the edge, i.e., whether an edge touches s or any of 
the candidate nodes c 6 C. 

(A) Choice of the loss function. As is the case with most machine 
learning algorithms, the choice of loss function plays an important 
role. Ideally we would like to optimize the loss function h(-) which 
directly corresponds to our evaluation metric (i.e., AUC or Preci- 
sion at top k). However, as such loss functions are not continuous 
and not differentiable and so it is not clear how to optimize over 
them. Instead, we experiment with three common loss functions: 

• Squared loss with margin b: 

h(x) = max{i + b, 0} 2 

• Huber loss with margin b and window z > b: 



Effect of a value on Hep-ph performance 



h(x) 



if x < -b, 

if -b < x < z - b, (7) 
if x > z — b 





(x + bf/(2z) 
b) - z/2 

• Wilcoxon-Mann- Whitney (WMW) loss with width b (Pro 
posed to be used when one aims to maximize AUC [ 32 ]): 

1 



h(x) 



1 + exp(—x/b) 



Each of these loss functions is differentiable and needs to be 
evaluated for all pairs of nodes d £ D and I € L (see Eq.[2]l. Per- 
forming this naively takes approximately 0(c 2 ) where c = \DUL\. 
However, we next show that the first two loss functions have the ad- 
vantage that they can be computed in 0(c log c) . For example, we 
rewrite the squared loss as: 

'Y^ l h(pi~p d )= (Pi-Pd + b) 2 

d,l Ijd-.pi+b^pd 

= £ £ (pi + bf -2( Pl + b) Pd +p 2 d 

l d:p t + b>p d 

Y^\{d. Pl + b>p d }\(p l +b) 2 
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Figure 5: Impact of random walk restart parameter a. 

Once we have the lists {pi } and {p d } sorted, we can iterate over 
the list {pi} in reverse order. As we do this, we can incrementally 
update the two terms which sum over d above. The Huber loss can 
as well be quickly evaluated using a similar calculation. 

Computation of the WMW loss is more expensive, as there is no 
way to go around the summation over all pairs. Evaluating WMW 
loss thus takes time 0(|D| ■ |L|). In our case, \D\ is typically 
relatively small, and so the computation is not a significant part 
of total runtime. However, the primary advantage of it is that it 
performs slightly better. Indeed, in the limit as b goes to 0, it reflects 
AUC, as it measures the number of inversions in the ordering [32]. 

In our experiments we notice that while the gradient descent 
achieves significant reduction in the value of the loss for all three 
loss functions, this only translates to improved AUC and Prec@20 
for the WMW loss. In fact, the model trained with the squared or 
the Huber loss does not perform much better than the baseline we 
obtain through unweighted PageRank. Consequently, we use the 
WMW loss function for the remainder of this work. 

(B) Choice of edge strength function f w (V'uv). The edge strength 
function must be non-negative and differentiable. While 

more complex functions are certainly possible, we experiment with 
two functions. In both cases, we start by taking the inner product of 
the weight vector w and the feature vector ip uv of an edge (it, v). 
This yields a single scalar value, which may be negative. To trans- 
form this into the desired domain, we apply either an exponential 
or logistic function: 



Exponential edge strength: 
Logistic edge strength: a uv 



a uv = exp(ip U v ■ w) 
= (1 + exp(-ip uv ■ w))~ 
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Pd 



d-Pi+b>p d 
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Our experiments show that the choice of the edge strength func- 
tion does not seem to make a significant impact on performance. 
There is slight evidence from our experiments that the logistic func- 
tion performs better One problem that can occur with the exponen- 
tial version is underflow and overflow of double precision floating 
point numbers. As the performance seems quite comparable, we 
recommend the use of the logistic to avoid this potential pitfall. 

(C) Choice of a. To get a handle on the impact of random walk 
restart parameter a, it is useful to think of the extreme cases, for un- 
weighted graphs. When a — 0, the PageRank of a node in an undi- 
rected graph is simply its degree. On the other hand, when a ap- 
proaches 1, the score will be exactly proportional to the "Random- 
Random" model 1 1 91 which simply makes two random hops from 
s, as random walks of length greater than 2 become increasingly 
unlikely, and hence the normalized eigenvector scores become the 
same as the Random-Random scores 1 191 . When we add the notion 
of edge strengths, these properties remain. Intuitively, a controls 



for how "far" the walk wanders from seed node s before it restarts 
and jumps back to s. High values of a give very short and local 
random walks, while low values allow the walk to go farther away. 

When evaluating on real data we observe that a plays an impor- 
tant role in the simple unweighted case when we ignore the edge 
strengths, but as we give the algorithm more power to assign dif- 
ferent strengths to edges, the role of a diminishes, and we see no 
significant difference in performance for a broad range of choices 
a. Figure [5] illustrates this; in the unweighted case (i.e., ignoring 
edge strengths) a = 0.3 performs best, while in the weighted case 
a broad range from 0.3 to 0.7 seem to do about equally well. 

(D) Regularization parameter A. Empirically we find that over- 
fitting is not an issue in our model as the number of parameters w 
is relatively small. Setting A = 1 gives best performance. 

Extension: Edge types. The Supervised Random Walks frame- 
work we have presented so far captures the idea that some edges 
are stronger than others. However, it doesn't allow for different 
types of edges. For instance, it might be that an edge (it, v) be- 
tween s's friends it and v should be treated differently than the 
edge (s, 11) between s and it. Our model can easily capture this 
idea by declaring different edges to be of different types, and learn- 
ing a different set of feature weights w for each edge type. We can 
take the same approach to learning each of these weights, comput- 
ing partial derivatives with respect to each one weight. The price 
for this is potential overfitting and slower runtime. 

In our experiments, we find that dividing the edges up into multi- 
ple types provides significant benefit. Given a seed node s we label 
the edges according to the hop-distance from s of their endpoints, 
e.g., edges (s,u) are of type (0,1), edges (u,v) are either of type 
(1,1) (if both u and v link to s) or (1,2) (if v does not link to s). 
Since the nodes are at distance 0, 1, or 2 from s, there are 6 pos- 
sible edge types: (0,1), (1,0), (1,1), (1,2), (2,1) and (2,2). While 
learning six sets of more parameters w increases the runtime, using 
multiple edge types gives a significant increase in performance. 

Extension: Social capital. Before moving on to the experimental 
results, we also briefly examine somewhat counterintuitive behav- 
ior of the Random Walk with Restarts. Consider a graph in Figure[6] 
with the seed node s. There are two nodes which s could form a 
new connection to vi and V2. These two are symmetric except for 
the fact that the two paths connecting s to Hi are connected them- 
selves. Now we ask, is s more likely to link to vi or to i>2? 

Building on the theory of embeddedness and social capital 1101 
one would postulate that s is more likely to link to v± than to V2- 
However, the result of an edge (141,112) is that when a > 0, 1*2 
ends up with a higher PageRank score than i>i . This is somewhat 
counterintuitive, as vi somehow seems "more connected" to s than 
V2- Can we remedy this in a natural way? 

One solution could be that carefully setting a resolves the is- 
sue. However, there is no value of a > which will make the 
score of vi higher than V2 and changing to other simple teleport- 
ing schemes (such as a random jump to a random node) does not 
help either. However, a simple correction that works is to add the 
number of friends a node w has in common with s, and use this 
as an additional feature 7 on each edge (u, id). If we apply this to 
the graph shown in Figure [6] and set the weight along each edge to 
1 + 7, then the PageRank score p vi of node v± is 1.9 greater than 
of V2 (as opposed to 0.1 smaller as in Fig[6]l. 

In practice, we find that introducing this additional feature 7 
helps on the Facebook graph. In Facebook, connection (111,1x2) 
increases the probability of a link forming to i>i by about 50%. In 
the co-authorship networks, the presence of (111,112) actually de- 
creases the link formation probability by 37%. Such behavior of 
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Figure 6: Stationary random walk distribution with a = 0.15. 
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Figure 7: Performance of Supervised Random Walks as a func- 
tion of the number of steps of parameter estimation procedure. 

co-authorship networks can be explained by the argument that long 
range weak ties help in access to new information 1141 (i.e., s is 
more likely to link to V2 than V\ of Fig[6). Having two independent 
paths is a stronger connection in the co-authorship graph, as this 
indicates that s has written papers with two people, on two differ- 
ent occasions, and both of these people have written with the target 
v, also on two different occasions. Thus, there must be at least 
four papers between these four people when the edge (111,112) is 
absent, and there may be as few as two when it is present. Note this 
is exactly the opposite to the social capital argument |10| , which 
postulates that individuals who are well embedded in a network or 
a community have higher trust and get more support and informa- 
tion. This is interesting as it shows that Facebook is about social 
contacts, norms, trying to fit in and be well embedded in a circle of 
friends, while co-authorship networks are about access to informa- 
tion and establishing long-range weak ties. 

5.2 Experiments on real data 

Next we evaluate the predictive performance of Supervised Ran- 
dom Walks (SRW) on real datasets. We examine the performance 
of the parameter estimation and then compare Supervised Random 
Walks to other link-prediction methods. 

Parameter estimation. Figure [7] shows the results of gradient de- 
scent on the Facebook dataset. At iteration 0, we start with un- 
weighted random walks, by setting w = 0. Using L-BFGS we 
perform gradient descent on the WMW loss. Notice the strong cor- 
relation between AUC and WMW loss, i.e., as the value of the loss 
decreases, AUC increases. We also note that the method basically 
converges in only about 25 iterations. 

Comparison to other methods. Next we compare the predictive 
performance of Supervised Random Walks (SRW) to a number of 
simple unsupervised baselines, along with two supervised machine 
learning methods. All results are evaluated by creating two inde- 



Learning Method 


AUC 


Prec@20 


Random Walk with Restart 


0.63831 


3.41 


Adamic-Adar 


0.60570 


3.13 


Common Friends 


0.59370 


3.11 


Degree 


0.56522 


3.05 


DT: Node features 


0.60961 


3.54 


DT: Network features 


0.59302 


3.69 


DT: Node+Network 


0.63711 


3.95 


DT: Path features 


0.56213 


1.72 


DT: All features 


0.61820 


3.77 


LR: Node features 


0.64754 


3.19 


LR: Network features 


0.58732 


3.27 


LR: Node+Network 


0.64644 


3.81 


LR: Path features 


0.67237 


2.78 


LR: All features 


0.67426 


3.82 


SRW: one edge type 


0.69996 


4.24 


SRW: multiple edge types 


0.71238 


4.25 



Table 2: Hep-Ph co-authorship network. DT: decision tree, LR: 
logistic regression, and SRW: Supervised Random Walks. 



Learning Method 


AUC 


Prec@20 


Random Walk with Restart 


0.81725 


6.80 


Adamic-Adar 


0.81586 


7.35 


Common Friends 


0.80054 


7.35 


Degree 


0.58535 


3.25 


DT: Node features 


0.59248 


2.38 


DT: Network features 


0.76979 


5.38 


DT: Node+Network 


0.76217 


5.86 


DT: Path features 


0.62836 


2.46 


DT: All features 


0.72986 


5.34 


LR: Node features 


0.54134 


1.38 


LR: Network features 


0.80560 


7.56 


LR: Node+Network 


0.80280 


7.56 


LR: Path features 


0.51418 


0.74 


LR: All features 


0.81681 


7.52 


SRW: one edge type 


0.82502 


6.87 


SRW: multiple edge types 


0.82799 


7.57 



Table 3: Results for the Facebook dataset. 



pendent datasets, one for training and one for testing. Each perfor- 
mance value is the average over all of the graphs in the test set. 

Figure [8] shows the ROC curve for Astro-Ph dataset, compar- 
ing our method to an unweighted random walk. Note that much 
of the improvement in the curve comes in the area near the ori- 
gin, corresponding to the nodes with the highest predicted values. 
This is the area that we most care about, i.e., since we can only 
display/recommend about 20 potential target nodes to a Facebook 
user we want the top of the ranking to be particularly good (and do 
not care about errors towards the bottom of the ranking). 

We compare the Supervised Random Walks to unsupervised link- 
prediction methods: plain Random Walk with Restarts, Adamic- 
Adar score JT|, number of common friends, and node degree. For 
supervised machine learning methods we experiments with deci- 
sion trees and logistic regression and group the features used for 
training them into three groups: 

• Network features: unweighted random walk scores, Adamic- 
Adar score, number of common friends, and degrees of nodes 
s and the potential target c £ C 

• Node features: average of the edge features for those edges 
incident to the nodes s and c £ C, as described in Section|4] 

• Path features: averaged edge features over all paths between 
seed s and the potential destination c. 



Dataset 


AUC 


Prec@20 




SRW 


LR 


SRW 


LR 


Co-authorship Astro-Ph 


0.70548 


0.67639 


2.55 


2.15 


Co-authorship Cond-Mat 


0.74173 


0.71672 


2.54 


2.61 


Co-authorship Hep-Ph 


0.71238 


0.67426 


4.18 


3.82 


Co-authorship Hep-Th 


0.72505 


0.69428 


2.59 


2.61 


Facebook (Iceland) 


0.82799 


0.81681 


7.57 


7.52 



Table 4: Results for all datasets. We compare favorably to lo- 
gistic features as run on all features. Our Supervised Random 
Walks (SRW) perform significantly better than the baseline in 
all cases on ROC area. The variance is too high on the Top20 
metric, and the two methods are statistically tied on this metric. 



ROC Curve for SRW and RW w/ Restart 
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Figure 8: ROC curve of Astro-Ph test data. 

Tables [2] and [3] compare the results of various methods on the 
Hep-Ph co-authorship and Facebook networks. In general, we note 
very performance of Supervised Random Walks (SRW): AUC is in 
the range 0.7-0.8 and precision at top 20 is between 4.2-7.6. We 
consider this surprisingly good performance. For example, in case 
of Facebook this means that out of 20 friendships we recommend 
nearly 40% of them realize in near future. 

Overall, Supervised Random Walks (SRW) give a significant im- 
provement over the unweighted Random Walk with Restarts (RWR). 
SRW also gives gains over other techniques such as logistic re- 
gression which combine features. For example, in co-authorship 
network (Tab. [2} we note that unsupervised RWR outperforms de- 
cision trees and slightly trails logistic regression in terms of AUC 
and Prec@20. Supervised Random Walks outperform all methods. 
In terms of AUC we get 6% and in terms of Prec@20 near 12% 
relative improvement. In Facebook (Tab. [3), Random Walk with 
Restarts already gives near-optimal AUC, while Supervised Ran- 
dom Walks still obtain 11% relative improvement in Prec@20. 

It is important to note that, in addition to outperforming the other 
methods, Supervised Random Walks do so without the tedious pro- 
cess of feature extraction. There are many network features relating 
pairs of unconnected nodes (Adamic-Adar was the best out of the 
dozens examined in 1211 , for example). Instead, we need only se- 
lect the set of node and edge attributes, and Supervised Random 
Walks take care of determining how to combine them with the net- 
work structure to make predictions. 

Last, Table[4]compares the performance of top two methods: Su- 
pervised Random Walks and logistic regression. We note that Su- 
pervised Random Walks compare favorably to logistic regression. 
As logistic regression requires state of the art network feature ex- 
traction and Supervised Random Walks outperforms it out of the 
box and without any ad hoc feature engineering. 

When we examine the weights assigned, we find that for Face- 
book the largest weights are those which are related to time. This 



makes sense as if a user has just made a new friend u, she is 
likely to have also recently met some of us friends. In the co- 
authorship networks, we find that the number of co-authored pa- 
pers and the cosine similarity amongst titles were the features with 
highest weights. 

Runtime. While the exact runtime of Supervised Random Walks 
is highly dependent on the graph structure and features used, we 
give some rough guidelines. The results here are for single runs on 
a single 2.3Ghz processor on the Facebook dataset. 

When putting all edges in the same category, we have 8 weights 
to learn. It took 98 iterations of the quasi-Newton method to con- 
verge and minimize the loss. This required computing the PageR- 
anks of all the nodes in all the graphs (100 of them) 123 times, 
along with the partial derivatives of each of the 8 parameters 123 
times. On average, each PageRank computation took 13.2 steps 
of power-iteration before converging, while each partial derivative 
computation took 6.3 iterations. Each iteration for PageRank or 
its derivative takes 0(|£|). Overall, the parameter estimation on 
Facebook network took 96 minutes. By contrast, increasing the 
number of edge types to 6 (which gives best performance) required 
learning 48 weights, and increased the training time to 13 hours on 
the Facebook dataset. 

6. CONCLUSION 

We have proposed Supervised Random Walks, a new learning al- 
gorithm for link prediction and link recommendation. By utilizing 
node and edge attribute data our method guides the random walks 
towards the desired target nodes. Experiments on Facebook and co- 
authorship networks demonstrate good generalization and overall 
performance of Supervised Random Walks. The resulting predic- 
tions show large improvements over Random Walks with Restarts 
and compare favorably to supervised machine learning techniques 
that require tedious feature extraction and generation. In contrast, 
our approach requires no network feature generation and in a prin- 
cipled way combines rich node and edge features with the structure 
of the network to make reliable predictions. 

Supervised Random Walks are not limited to link prediction, and 
can be applied to many other problems that require learning to rank 
nodes in a graph, like recommendations, anomaly detection, miss- 
ing link, and expertise search and ranking. 
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